Convolutional neural networks are designed to handle
grid-structured data, such as image data, where there is a
strong local dependency between the neighboring items of the
grid.
Examples are images, text, and
sound.
Here we only focus on image
data.
For grid-structured data, convolutional neural networks are
computationally more efficient than the fully connected neural networks.
This is primarily because convolutional neural
networks require fewer
parameters than fully connected networks
Image data exhibits two key properties, namely, Translation invariance and Locality.
Translation invariance: The classification decision
for each image is independent of the position
of the animal on the image. A cat is a cat irrespective of whether it
appears at the top or at the bottom of an image.
Locality: The classification decision does not
really depend on a pixel that is far away from
the animal on the image. A cat is still a cat irrespective of whether
far away pixels correspond to a building or a tree.
Recall that when using the fully connected neural networks for
images, the first step is to convert each image to an input
features vector. By doing this we may lose both the above mentioned two structural
properties.
Filtering
To understand convolutional neural networks it is helpful to have
a basic understanding of filtering, a
well-established field in signal processing.
filtering is a method or process
that removes certain unwanted information from a signal
or an image, or alternatively enhances it by
accentuating certain information.
filtering applies mathematical
operations on input images, with the most common operation being the
convolution. A convolution can be
viewed as an operation between two mathematical objects, such as
two matrices, where one represents an
image and the other a filter.
Each filter is often custom designed depending on the specific
task at hand. For example, a popular filter, called the Sobel filter, is useful for the task of
edge detection. See the following figure, where in (b)
we detect the vertical edges and in
(c) we detect the horizontal edges. By
adding the outputs of these two filtering operations, we get the final
image shown in (d) which captures most of the vertical and horizontal
edge information.
Note: Convolutional neural networks build upon the
classic ideas of filtering using convolutional
layers. Each convolutional layer is made up of one or more filters, also known as kernels, each of which aims to extract a
particular feature of the input to the layer.
Example CNN Network: VGG19
To get a feel for convolutional neural networks let us consider the
task of classifying color images using the VGG19 model
(VGG stands for Visual Geometry Group,
the group at Oxford University that created the network).
This network is designed to train on images in
ImageNet where the dimension of each image is \(3 \times 224 \times 224\), here \(3\) is the number of channels (red, green,
and blue), and \(224\times 224\)
specifies the pixel dimensions.
Since there are \(1,000\) classes
in the ImageNet, the output of VGG19 is a probability vector of length \(1,000\).
The VGG19 model has about 144
million parameters
Practical 1:
Open Tutorial 1 of CNN available on the workshop GitHub page.
Alternatively, click
here.
Save a copy of this in your Google Colab.
In this exercise, we input two images, which are not part of the
ImageNet dataset, to see what is the output of a pretrained
VGG19.
Note that training of VGG19 takes many days on a regular
computer, and hence we use a pretrained network.
The Convolution Operation
As mentioned earlier, the convolution
operation is a key component of convolutional neural
networks.
A convolution can be viewed as an operation on two functions
which creates a third function. In finite domains, these functions may
be represented as vectors, matrices, or tensors.
The convolution operation
between two matrices \(W\) and \(x\) is denoted by \(W\star x\). Suppose \(K_h \times K_v\) and \(M_h \times M_v\) are the dimensions of
\(W\) and \(x\), respectively.
In the context of CNNs, \(W\)
represents a kernel at a layer and
\(x\) represents an input to that layer. Further, it is usual
that \(K_h \leq M_h\) and \(K_v \leq M_v\).
The convolution \(W\star x\) is a
matrix of dimension \((M_h - K_h + 1) \times
(M_v - K_v + 1),\) where for output at \(i' = 1, \dots, M_h - K_h + 1\) and
\(j' = 1, \dots, M_v - K_v + 1\),
the convolution action is,
This convolution operation is illustrated below where \(W\) is dimension \(3\times 3\) and \(x\) is dimension \(6\times 7\). Thus \(W\star x\) is dimension \((6-3+1)\times(7-3+1)\)
Convolution operation.
In particular, the first element of the output matrix \(z\) is computed as shown in the following
figure.
Computing \(z_{1,1}\). Here, \(\odot\) denotes element-wise product
between two matrices and \(\sum\)
denotes summation of all the elements of a matrix.
Edge Detection
Recall the edge detection example presented earlier. We have seen
that both horizontal and vertical edges
are detected via convolution operation
to obtain the final output.
The filter \(W\) for detecting the
horizontal edges is given by
The filter \(W\) for detecting the
vertical edges is given by \[\begin{equation}
\begin{bmatrix}
-1 & 0 & 1\\
-2 & 0 & 2\\
-1 & 0 & 1\\
\end{bmatrix}.
\end{equation}\]
Note 1:
The filtering operation at each layer in a CNN is similar to
classical filtering, such as edge detection above.
However, unlike classical filtering, filters in CNNs
are learned rather than
designed.
A training dataset is used for learning the filters before using
the network for image processing. Here, learning a filter means learning
the entries of the matrix that represents the filter, and these entries
are called weights as the filters play a similar role
to the weight matrices of a fully connected neural networks.
Note 2: Since filters in CNNs are learned rather
than designed, there is no need of flipping operation,
instead, we directly learn the flipped filter.
Building a Convolutional Layer
In the earlier part of this workshop, we have seen the
construction of general fully connected neural
networks, each of which consists of a series of layers where
every neuron in a given layer is connected to every neuron in the next
layer.
Fully connected networks are general in the sense that they are
structure agnostic, that is, there are
no specific assumptions made about the structure of the
input. This property makes fully connected neural networks versatile.
However, fully connected networks are inadequate when dealing with the input that
has rich structural properties such as
images.
Convolutional neural networks make use of the aforementioned two
key properties of grid-structured data, namely translation
invariance and locality.
As a result, the number of parameters to learn in convolutional
neural networks is significantly
smaller than that of corresponding fully connected neural
networks.
Can we use fully connected networks for image data?
Answer: Yes!
To use fully connected neural networks on images \(x\) of dimension \(M_h^{[0]} \times M_v^{[0]}\), they are
first converted to a vector of length \(M_h^{[0]}\cdot M_v^{[0]}\) in a consistent
manner.
Without loss of generality we can continue to index the elements
of the vector \(x\) via tuples \((i, j) \in \{1, \dots, M_h^{[0]}\} \times \{1,
\dots, M_v^{[0]}\}\).
The following figure shows matrix to vector conversion before
inputting it to a fully connected layer.
A consistent matrix-to-vector conversion operation.
The following figure shows reconstruction of the matrix from the
corresponding vector. This operation can be used on outputs of a fully
connected layer if the length of the output is a product of two
integers.
Reproducing matrix from the vector.
Issues:
We lose both the structural properties: translation invariance and
locality.
Too many parameters to learn.
Motivating a Convolutional Layer (Simplified)
Building a Layer with Local Understanding:
In a fully connected layer, every neuron connects to every input, but
images have spatial structure. In convolutional layers,
we build a system that only looks at local
regionsin the image, processing small areas at a time. This local
focus helps us capture meaningful
patterns, such as edges or textures, that are essential for
understanding images.
Reducing Parameters with Repeated
Patterns:
Convolutional layers reduce the number of parameters by applying the
same filter across the image. This
filter is like a small window that
slides over the image, examining one small region at a
time. Using a single filter repeatedly
means we don’t need a new set of parameters for every region, making the
network more efficient. It also gives the network a type of
memory, helping it recognize patterns across the
image.
Translation Invariance:
When we apply a filter across the
entire image, the network gains translataion
invariance. This means it can recognize patterns, such as the
edge of an object, no matter where they appear in the image. For
instance, if a network learns to detect edges or shapes, it doesn’t
matter if those shapes are in the top left or
bottom right—this invariance lets it recognize objects
regardless of position. As an illustration, let us revisit edge
detection and consider a pelican in flight as shown in following
figures.
Figure (a):
An input and the corresponding output of edge detection.
Figure (b):
Another input and the corresponding output of edge detection.
Observe that Figures (a) and (b) are essentially the same, except
for the fact that the position of the pelican in the output depends only on its position in the input. In
other words, the filtering operation’s action on the object is generally
independent of the location of the object in the image.
Locality for Efficient Recognition:
We now see further reduction of the parameters by invoking the second
structural property, namely locality.
Convolutional layers assume that nearby
pixels are more relevant to each other than distant ones. In
other word, a pixel \(x_{i,j}\) is
not significantly influenced by far away
pixels. A motivational illustration is in the following figures
consisting of pelicans and seagulls, with each individual bird enclosed
in a red box.
Images of birds to illustrate the property of locality. The pixel
information within each red box is sufficient to understand the
characteristics of the bird inside the box.
Locality implies that if we are
seeking information about a bird, then it is sufficient to know the
pixel information only within the box that is enclosing the bird.
Similarly, at a much finer level when we seek information about edges or
similar features, it is often enough to
consider 1, 2, or 3, neighboring pixels in each direction –
yielding convolution kernels of size \(3\times
3\), \(5\times 5\), or \(7\times 7\) respectively. They’re designed
to recognize details without being influenced
by far-away pixels
Building Complex Understanding from Simple
Filters:
By stacking multiple convolutional
layers, each layer can capture increasingly complex features. Early layers
might detect simple edges, while later ones recognize parts of objects
or even entire objects.
\(b\) is the scalar bias is
element-wise to each element of the matrix term for layer \(W\star x\).
\(S^{[1]}(\cdot)\) an
activation function (often a ReLU) is applied to the result
Summary: We have seen that at its core, a single
convolutional layer involves the following actions on the input \(x\). First it is convolved with a
convolution kernel \(W\). Then the
result is shifted by a scalar bias \(b\). Finally an activation function \(S^{[1]}(\cdot)\) is applied. Note that when
the input dimension is \(M_h^{[0]} \times
M_v^{[0]}\), the dimension of the output is,
Exercise: For an illustration of the reduction in
the number of parameters that a convolutional layer has in comparison to
a fully connected layer, consider an example with input dimension \(M_h^{[0]} \times M_v^{[0]} = 224 \times
224\) and a case with kernel dimension \(K_h \times K_v = 3\times 3\). Then the
output dimension is \(M^{[1]}_h \times
M^{[1]}_v = 222 \times 222\).
If we were to seek the same size of output dimension with a
fully connected layer, we have \(222\times 222 = 49,284\) neurons. Since the
input size is \(224\times 224 =
50,176\), the dimension of the weight matrix is the product of
the input size and output size (number of neurons), and together with
the bias vector (one entry for each neuron) we have \(2,472,923,268\) parameters.
In contrast, in the convolutional layer there are only \(3 \times 3 + 1 = 10\) parameters. While on
its own, such a single convolution layer is certainly not as expressive
as the fully connected layer with 2.5 billion learned parameters, as we
see below, combining convolutional layers in tandem yields very powerful
networks with much fewer parameters than their fully connected
counterparts.
Alterations to the Convolution: Padding, Stride, and Dilation
The convolution appearing above is often tweaked and modified in
the context of image data.
Specifically, alterations to the convolution operation, known as
padding, stride, and dilation, are sometimes employed.
We now introduce and illustrate each of these alterations
separately.
Padding
Recall the edge detection example above. Due to the convolution
operation, the output image dimension is
smaller than the input image dimension.
In particular, since the filter dimension \(K_h \times K_v\) is \(3 \times 3\) (Sobel filter), when the input
dimension is \(M^{[0]}_h \times
M^{[0]}_v\), the output dimension is equal to \((M^{[0]}_h -2) \times (M^{[0]}_v -
2)\).
Hence we see a slight reduction of the image size at the output.
Since convolutional neural networks typically consist of several
convolution layers, the dimension reductions in each of these layers can
accumulate, making the overall downstream dimension undesirably
small.
Padding is a simple solution to
overcome this problem by adding extra zero-valued pixels around the
input so that the effective input dimension is higher, and the desired
output dimension is obtained.
In particular, padding is
parameterized by a pair of non-negative integers \((p_h, p_v)\), where \(p_h\) and \(p_v\) respectively denote the number of
all-zero rows and all-zero columns added to the input matrix. An example
of padding is illustrated in the following figure.
Illustration of convolution with padding.
Output dimension after padding:
Again suppose that the input dimension is \(M^{[0]}_h\times M^{[0]}_v\) and the kernel
dimension is \(K _h \times K _v\).
Further, suppose each input image is modified by adding \(p_h\) rows roughly half on the top and half
on the bottom, and \(p_v\) columns
roughly half on the left and half on the right.
Then it is easy to check that the output dimension is
The setting \((p_h, p_v) = (K_h-1,
K_v-1)\) is a mechanism for ensuring that the input and the
output are of the same dimension.
Also note that typically convolutional neural networks are
designed to have kernels of odd height and odd width. Hence it is common
to pad with exactly \(p_h/2\) rows of
zeros on the left and \(p_h/2\) rows of
zeros on the right, and similarly for the vertical dimension. This helps
maintain spatial symmetry while conducting convolutions.
Stride
The convolutions we presented up to now involved shifts of the
convolution kernel by one pixel at a time. This is called a convolution
with a stride of one.
However, in many applications, we may wish to slide the
convolution kernel with bigger steps in order to either reduce the computational cost, or to reduce the dimension of the output of the
convolution layer.
In particular, we can conduct stride with positive integer valued
parameters \((s_h, s_v)\), where \(s_h \geq 1\) and \(s_v \geq 1\) respectively denote the stride
step sizes along height and width. An illustration is provided in the
following figure.
Illustration of convolution with stride.
Output dimension after padding and stride:
With both padding of \((p_h, p_v)\)
and stride of \((s_h, s_v)\), the
output dimension will be
Dilatation is a technique for
increasing the receptive field of a
filter without increasing the number of learned parameters.
This is achieved by spreading
out the elements of the kernel matrix \(W\) via the insertion of zeros between
elements. This alteration allows the kernel to cover a larger area of
the input image, that is, the effective kernel dimension
increases.
The level of dilation (or, the number of zeros added) is
controlled using a pair of two positive integers \((d_h, d_v)\). In particular, it converts a
kernel of size \(K_h \times K_v\) to a
kernel of size \(K_h' \times K_v' =
(d_h(K_h - 1) + 1) \times (d_v(K_v - 1) + 1)\) by adding all-zero columns and all-zero rows between
the columns and rows of the original kernel.
See the following figure for illustration.
Figure (a): No dilation (i.e., \((d_h, d_v) =
(1,1)\)). Figure (b): Dialtion with \((d_h, d_v) = (2,2)\). Source: He et al,
2021.
Figure: Illustration of dilation operation with \((d_h, d_v) = (2,2)\) extending a \(3\times 3\) filter to create a receptive
field of \(5\times 5\) channels.
Output dimension after padding, stride and
dilation:
The three alterations, namely padding with \((p_h, p_v)\), stride with \((s_h, s_v)\) and dilation with \((d_h, d_v)\), results in an output of
dimension
\[\begin{equation}
M_h^{[1]} \times M_v^{[1]}
=
\left(1 + \Big\lfloor \frac{M_h^{[0]} - d_h(K_h-1)-1 + p_h}{s_h}
\Big\rfloor \right) \times \left(1 + \Big\lfloor \frac{M_v^{[0]} -
d_v(K_v-1)-1 + p_v}{s_v} \Big\rfloor\right),
\end{equation}\]\(\qquad\)where
\(\lfloor u \rfloor\) represents the
largest integer not greater than \(u\).
Inputs with Multiple Channels
So far in this section we have looked at the case where each
input is a matrix, usually representing a grey scale image.
However, convolutional networks often deal with inputs comprised
of multiple channels. For instance, a
color image has three channels representing the red, green, and
blue components.
When we have such data with multiple channels, input to a
convolution layer is no longer a matrix but is rather represented as a
three dimensional tensor of dimension, say, \(M^{[0]}_c \times M^{[0]}_h \times
M^{[0]}_v\), where the depth\(M^{[0]}_c\) denotes the number of
channels, and the other two numbers are for the horizontal and vertical
dimensions as used previously.
Hence for color images we use \(M^{[0]}_c = 3\) and further, as we describe
in the sequel, for a hidden layer we often
have more than \(3\) input
channels to the layer.
To deal with inputs with multiple channels, we use a kernel \(W\) with the same depth as the input. That is, we use a
kernel of dimension \(K_c \times K_h \times
K_v\) such that \(K_c =
M^{[0]}_c\), \(K_h \leq
M^{[0]}_h\), and \(K_v \leq
M^{[0]}_v\).
Illustration of convolution when the input has multiple channels.
Outputs with Multiple Channels
Until now, regardless of the number of input channels, the output
of a convolution layer is a matrix,
denoted via \(a^{[1]}\). This is
because, so far there is only one
kernel, possibly a tensor, operating on the input to the
convolution layer.
However, most popular convolutional neural networks have
convolutional layers with multiple
kernels operating on the input simultaneously. In this case, the
output of the layer is a collection of matrices denoted by \(a^{[1,j]}\) for \(j=1, \ldots, M_c^{[1]}\), where \(M_c^{[1]}\) is the number of output channels. Consequently, the
output can be viewed as a 3-dimensional tensor of dimension \(M^{[1]}_c \times M^{[1]}_h \times
M^{[1]}_v\).
The following figure illustrate this.
Illustration of convolution when multiple kernels are used in parallel
(i.e., multiple output channels).
Building a Convolutional Neural Network
We have now acquired all the crucial elements necessary for
constructing convolutional neural networks, such as the VGG19 model
depicted earlier. We now put the pieces together for constructing a
convolutional neural network that, in addition to convolutional layers,
includes fully connected layers, and pooling. layers.
A convolutional neural network is generally deep
with multiple layers, similar to feed-forward networks. However, unlike
feed-forward networks which consist of only fully connected layers,
convolutional neural networks have different types of layers, of which
some are trainable and the others are
not, and the trainable layers are further broken up
into convolutional layers and dense layers.
Here \(L_{\text{train}}\),
counts the number of trainable layers, whereas \(L_{\text{pool}}\) counts the number of
layers that do not have trainable parameters. Further, the trainable
layers are either convolutional layers, counted by \(L_{\text{conv}}\), or fully connected
layers, counted by \(L_{\text{dense}}\).
Similar to a feed-forward network, the goal of a convolutional
neural network is to approximate some
unknown function \(f^*(\cdot)\). For
instance, for classification of image data with animal faces, the
function value \(f^*(x)\) for any given
image \(x\) may yield a probability vector with the highest weight on
the index associated with the label of the image \(x\).
A convolutional neural network defines a mapping \(f_{\mathbb{\theta}}(\cdot)\) and learns the values of the unknown parameters\(\mathbb{\theta}\) that ideally result
in \(f^*(x) \approx f_\theta(x)\) for
as many input images \(x\) as
possible.
In general, similar to feed-forward networks, the approximating
function \(f_\theta(\cdot)\) is
recursively composed as
\[
f_{\mathbf{\theta}}(x)=f_{\mathbb{\theta}^{[L]}}^{[L]}(f_{\mathbb{\theta}^{[L-1]}}^{[L-1]}(\ldots
(f_{\mathbb{\theta}^{[1]}}^{[1]}(x))\ldots)),
\]\(\qquad\)where for each
\(\ell\), the function \(f_{{\mathbb{\theta}}^{[\ell]}}^{(\ell)}(\cdot)\)
is associated with the \(\ell\)th layer
which depends on the layer’s parameters \({\mathbb{\theta}}^{[\ell]} \in
\Theta^{[\ell]}\).
Note that for layers that are not
trainable (as counted via \(L_{\text{pool}}\)), the parameter space
\(\Theta^{[\ell]}\) is empty.
Similarly to feed-forward networks, it is useful to denote the
neuron activations of the network via \(a^{[1]},a^{[2]} \ldots, a^{[L]}\) where
\(a^{[L]} = \hat{y}\) is the output,
and for \(\ell=1,\ldots,L-1\), \[
a^{[\ell]} = f^{[\ell]}_{\theta^{[\ell]}}(a^{[\ell-1]}),
\] with \(a^{[0]} = x\).
Convolutional Layers
When the \(\ell\)-th layer of
the network is a convolution layer, then \(f_{{\mathbb{\theta}}^{[\ell]}}^{(\ell)}(\cdot)\)
take the output \(a^{[\ell-1]}\) of the
previous layer as the input and conducts convolution.
In this case the input and output are generally 3-dimensional tensors as we have seen in the
previously.
Pooling Layers
As mentioned above, there are also non-trainable layers counted by \(L_{\text{pool}}\) and these are typically
called pooling layers. The main idea of
a pooling layer is to reduce the height
and width of the input tensor \(a^{[\ell-1]}\) to achieve a lower
dimensional output tensor \(a^{[\ell]}\) without changing the
depth.
Generally for some fixed channel \(j\), and pixel coordinates of the output
\((i,k)\), a pooling operation operates
on pixels from a window in the input denoted via \(\mathscr{I}_{(i,k)}\). Here \(\mathscr{I}_{(i,k)}\) is a set of pixel
coordinates in the input that are maped to the specific output pixel
\((i,k)\). There are two popular
pooling techniques used in practice, namely, max-pooling and average-pooling.
For each channel \(j\), the
pooling operation can be summarized as,
As is evident, max-pooling takes the maximal pixel value within the
window as the output, while average pooling averages pixel values within
the window for the output. See the following figure for an
illustration.
Note: The idea of pooling interplays with the notion
that the initial layers of a convolutional network
focus on pixel level features similar to edge
detection, and as we progress towards the final layers of the network,
the information is aggregated to address general
questions about the whole image. Thus deeper layers are less sensitive to translation changes on the
input image compared to the initial layers. For instance, the answer to
a question ``is there a bird in the photo?’’ is the
same irrespective of it is pelican or a seagull, even though the
corresponding outputs from the initial layers look different. Pooling
layers are applied after convolutional layers to support this
aggregation by progressively reducing the spatial
dimensions, allowing the network to recognize larger patterns
while focusing less on specific pixel locations.
Fully Connected Layers
Some layers of a CNN can be fully
connected. Such layers are typically deployed at the end of the
network. This is because the typical task of the last layers of
convolutional neural network is to address general questions, such as
classification of the objects in the
image.
Note that since fully connected layers operate on vectors as the
input, in cases the previous layer has a tensor as output, it is flattened to a vector.
Practice Excercise
Practical 2:
Open Tutorial 2 of CNN available on the workshop GitHub page.
Alternatively, click
here.
Save a copy of this in your Google Colab.
In this exercise, we build a simple CNN for classification using
CiFAR10 dataset.
VGG19 Revisited
We now take a closer look at the architecture of our running
example network, VGG19.
While this is not the most modern convolutional architecture, it
is instructive to consider it here since it falls directly within the
paradigms discussed above.
Table below provides complete details.
The total number of learned parameters of VGG19 networks is around
138 millions.
Dropout
Dropout is another popular technique for
improving the performance of CNNs. One simple tip for using dropout is
to include it after each fully connected
layer in the network. This helps to regularize the network and
prevent overfitting, which can improve the generalization performance on
new data.
Dropout is typically not used after convolutional layers in CNNs,
but rather after fully connected layers. In convolutional layers, the
neurons are typically spatially arranged in a grid, and dropping out
individual neurons could disrupt the spatial
structure of the feature maps. However, some researchers have
explored alternative forms of spatial dropout that are designed
specifically for convolutional layers, so it’s an area of ongoing
research.
Note: It’s also important to ctune the hyperparameter of dropout, which is
the dropout probability, to find the optimal value for your specific
network and dataset. Additionally, dropout can sometimes be sensitive to
the network architecture and the order in which layers are connected, so
it’s worth experimenting with different architectures and layer
connections to see if performance improves. Finally, it’s important to
note that dropout can sometimes be omitted in small CNNs or shallow
networks, as the benefits may not be significant compared to the added
computational overhead.
Batch Normalization
Similar to feed-forward networks, batch normalization is a
popular technique for improving the performance of CNNs.
One simple tip for using batch normalization is to include it
after each convolutional or fully connected layer in the network. This
helps to normalize the output of each
layer and reduce the internal covariate shift, and this can
improve the overall stability and speed of training.
Note: Batch normalization can sometimes be sensitive
to the batch size used during training, so it’s worth experimenting with
different batch sizes to see if performance improves. It is important to
note that batch normalization can sometimes be omitted in small CNNs or
shallow networks, as the benefits may not be significant compared to the
added computational overhead.
Understanding Inner Layers and Derived Features
“Visualizing and Understanding Convolutional
Networks” is a paper by Matthew Zeiler and Rob Fergus, which
proposes a method for visualizing and understanding the internal
representations learned by CNNs.
The authors start by highlighting the importance of understanding
how CNNs work. Specifically, they focus on the visualization of the
activation patterns in the feature maps produced by
each layer of the network. The following figure provide some
understanding of the features between the layers. For more details refer
to the original paper.
The network has many channels across multiple layers, and here we
present only a few of those channels, focusing on a pair of arbitrary
channels within each of the layers \(2\), \(3\), \(4\), and \(5\).
Each channel that we visualize has a \(3 \times 3\) grid of synthesized images
(channel visualization) as well as a matching \(3 \times 3\) grid of parts of images from a
dataset (original receptive field). These channel visualizations and
original receptive fields can serve as a visual interpretation of what the specific
channel detects.
For example, we see that the two channels visualized in
layer~\(2\) detect simple features with one channel focusing on
edges and another channel focusing on circles.
As we advance deeper in the network we see that the type of
visual patterns detected are much more complex. For example, the two
channels presented for layer~\(4\)detect parts of animals, and the
channels of layer~\(5\) detect such
representations as well.
Note however, that one of the channels in layer~\(5\) that we present appears to detect either faces or car wheels even though these
are very different objects. Hence any attempt to categorize channels
based on their ``meaning’’ alone is far from absolute.
Nevertheless, a visual representation such as this Figure can
help to understand the function of individual channels within the
network.
In summary, early layers of the network
tend to detect simple features such as edges and textures, while
later layers detect more complex features such as object parts and
entire objects.
Other Landmark Architectures
In addition to VGG19 architecture that achieved high accuracy in the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, there
are few other landmark architectures. The following list is small
summary of these models.
ResNet (Residual Network): ResNet is a CNN
architecture that introduced residual blocks, which allow for the reuse
of feature maps from earlier layers by adding skip connections. This
enables very deep networks (e.g. 152 layers) to be trained and achieve
state-of-the-art performance on image classification tasks.
A shortcut connection (residual connection) as part of a residual
network.
Inception (GoogLeNet): Inception is a CNN
architecture that introduced the Inception module, which uses multiple
filter sizes in parallel to extract features at different scales. This
allows for efficient use of computational resources and achieves high
accuracy on image classification tasks.
One form of an inception module, playing part in an inception network.
The key idea is parallel computation of various paths followed by a
concatenating of the outputs from all paths.
The Inception network is a type of convolutional neural network
architecture that was introduced in 2014 by Google
researchers.
The main idea behind the Inception network is to use multiple filter sizes in parallel at each
layer of the network, in order to capture features of different scales
and resolutions.
This approach is different from other popular CNN architectures,
such as VGG, which typically use small (3x3)
filters in all of their convolutional layers.
The Inception network also includes the use of so-called bottleneck layers, which use 1x1 convolutions to reduce the
dimensionality of the input before applying larger filters. This helps
to reduce the number of parameters in the network, while still allowing
it to capture complex features.
One of the key innovations of the Inception network is its use of
an Inception module, which is a basic
building block of the network that combines the various filter sizes and
bottleneck layers. The Inception module allows the network to capture
complex and multi-scale features at each layer, while also reducing the
number of parameters in the network.
MobileNet: MobileNet is a CNN architecture designed
for mobile and embedded devices that uses depthwise separable
convolutions to reduce the number of parameters and computations
required while maintaining high accuracy. It achieves state-of-the-art
performance on mobile platforms and real-time video analysis.
DenseNet: DenseNet is a CNN architecture that
introduces dense connectivity between layers, where each layer receives
feature maps from all preceding layers. This promotes feature reuse and
reduces the number of parameters required, leading to high accuracy and
efficient training.
AlexNet: AlexNet is a CNN architecture that was the
winner of the ILSVRC 2012. At the time, it was a
breakthrough in image classification, achieving state-of-the-art
performance on the ImageNet datase. It introduced the concept of using
ReLU activation functions, dropout
regularization, and data augmentation for training deep neural
networks. It also used GPU acceleration to speed up training and
achieved a significant reduction in error rate on image classification
tasks.
EfficientNet is a family of convolutional neural
network architectures that were developed with the aim of providing
better accuracy and efficiency in terms of
model size and computation cost. The key idea is to systematically scale up the dimensions of the
network’s parameters (such as depth, width, and resolution) in a
balanced way, while also introducing a new compound scaling method that
optimizes these dimensions based on a set of pre-defined
constraints.
Performance of various convolutional models as well as efficient net
Beyond Classification
far we have focused on the internals of convolutional neural
networks with a particular emphasis on the task of image
classification, e.g. determining if an image is that of a cat
or a dog.
However, there are several other important image analysis tasks
that are also handled with convolutional neural networks. These tasks
deal with analysis and understanding of an image including the location of objects, the count of objects,
separating between different semantic features of the image, and
more.
Note: In terms of the input data, it is important to
keep in mind that not all data is made of monochrome or color images.
Within computer vision, one often deals with image
sequences (short movies), or images that have more than 3
channels. For example, some images may also have a distance
channel capturing the distance from the camera per pixel.
Further, non-image data can also be handled via convolutional networks.
One such example is fMRI (functional magnetic resonance
imaging) data which is 4 dimensional in nature as it records
the state of physical locations (e.g., blood-oxygen-level-dependent
(BOLD) signal) in three dimensions over time.
Convolutional Networks and Key Computer Vision Tasks
As mentioned above, classification serves as a simple and useful
example. For an input image \(x\), a
convolutional neural network \(f_\theta(\cdot)\), has output \(\hat{y} = f_\theta(x)\) which is a
vector of probabilities where the highest probability typically determines the
appropriate label for the image.
As was evident from our detailed study of the VGG19 model and
other architectures, initial layers of the model \(f_\theta(\cdot)\) are typically
convolutional, and the final
layers are typically fully connected
layers.
These final layers help transform the internal
derived features in the network into the output vector of probabilities
\(\hat{y}\).
When one considers tasks other than classification, it is often
common to replace the final layers of the network with
other layers such that the output \(\hat{y}\) suites the desired
task.
The following figure illustrates key computer vision tasks for
images.
Object localization which is the task of
identifying the location of an object in an image, as well as possibly
the type of object in which case the task is called localization
and classification.
Object detection which is the task of detecting
multiple instances of an object in an image, also separating between the
objects and classifying their type.
Landmark detection, which is the task of
identifying the specific pixel locations of landmarks in an
image.
Semantic segmentation, which is the process of
classifying each individual pixel to be of a different class from a
finite set of classes (pixel wise classification).
Instance segmentation, which finds different
instances of objects in the image and separates pixels to be of
different instances.
Identification (face recognition), which
determines if an image is that of a specific instance (or
person).
Let us now consider possible forms of the output \(\widehat{y}\)
For object localization, \(\widehat{y}\) needs to contain information
about a bounding box which locates the object. This can
be in the form of \((\widehat{y}_x,\widehat{y}_y,\widehat{y}_h,\widehat{y}_w)\)
where \(\widehat{y}_x\) and \(\widehat{y}_y\) are the coordinates of
(say) the upper left corner of the bounding box and \(\widehat{y}_h\), \(\widehat{y}_w\) are the height and width of
the bounding box, respectively.
For object detection, a
collection of multiple bounding boxes needs to be supplied.
For landmark detection, a list
of coordinates of the locations of landmarks comprises the
output.
For semantic segmentation, each
pixel location in the input image, \(x\), has an associated probability vector
of classes in the output \(\widehat{y}\). Hence in this case, \(\widehat{y}\) can be represented as a
tensor with width and height dimensions the same as the input image, and
a depth dimension which is the number of classes in the
segmentation.
For instance segmentation, the
output is similar to that of semantic segmentation, but instead of
recording probabilities of classes, the depth dimension of the output
\(\widehat{y}\) is used for determining
the specific instance of any given pixel.
Finally, in the case of identification,
or face recognition, the output is often just a probability as in
a binary classifier, since the task is to determine if a face image
matches a given pre-stored template or not. The speciality in this case
is that the input\(x\) is typically composed of two
images, where one is the template of the person (e.g. a stored
image in a security database), and the other image is the
current image taken.
Object Localization
To get a feel for object localization assume that we wish to
train a CNN that operates on an input image \(x\) and determines if the image contains a
\(\texttt{bird}\) or a \(\texttt{plane}\) (classification).
The model’s second goal is to determine the specific
location\((\widehat{y}_x,\widehat{y}_y,\widehat{y}_h,\widehat{y}_w)\)
of that object (localization). Images with multiple birds or planes are
not considered. Images without a bird and without a plane are possible
and in this case the output yields \(\texttt{nothing}\).
One way to encode the output is \[
\widehat{y}=(\widehat{p}_{\texttt{nothing}},
~\widehat{p}_{\texttt{bird}}, ~\widehat{p}_{\texttt{plane}},
\widehat{y}_x,\widehat{y}_y,\widehat{y}_h,\widehat{y}_w),
\]
where as in standard classification examples \((\widehat{p}_{\texttt{nothing}},
~\widehat{p}_{\texttt{bird}}, ~\widehat{p}_{\texttt{plane}})\) is
a probability vector, and the other coordinates define a bounding
box.
Here an output that has \(\widehat{p}_{\texttt{nothing}}\) greater
than each of \(\widehat{p}_{\texttt{bird}}\) and \(\widehat{p}_{\texttt{plane}}\) implies a
prediction of no bird and no plane.
In terms of training data, for each input image we denote the
output as \(y\) where images without a
bird or a plane are labeled as, \(y = (1, 0,
0, \emptyset,\emptyset,\emptyset,\emptyset)\), where \(\emptyset\) are ``do not care’’ values.
Images with a bird are labeled as \(y =
(0,1,0, {y}_x, {y}_y, {y}_h,{y}_w)\) where the bounding box \(({y}_x, {y}_y, {y}_h,{y}_w)\) is typically
based on a manual determination by a human annotator. Similarly, images
with a plane are labeled as \(y = (0,0,1,
{y}_x, {y}_y, {y}_h,{y}_w)\).
We now construct a loss function
that captures closeness of \(\widehat{y}\) and \(y\). For this we first separate the
classification and localization objectives into a loss \(C_{\text{classification}}(\theta \,;\,
\widehat{y}, y)\) and \(C_{\text{localization}}(\theta \,;\, \widehat{y},
y)\). The former depends only on the probability components in
\(\widehat{y}\) and \(y\), and the latter depends only on the
bounding box components in \(\widehat{y}\) and \(y\). For the classification loss, we use categorical cross entropy. For the localization loss, we use a mean squared error, applied to the four
bounding box components.
The two separate losses are then combined such that the loss for a
specific observation is,
where \(\gamma > 0\) is a
hyper-parameter used to weigh the two losses and taken as \(\gamma=1\) by default. Observe that \(y_1 = 1\) when the label is \(\texttt{nothing}\) and is otherwise \(0\) and thus for labels in the training
data without a bird or a plane only the classification objective is
used.
To perform object localization, say with a model like
VGG19, the network can be modified by adding additional
layers at the end of the architecture to predict the coordinates of the
bounding box. This can be achieved by attaching a regression head to the output of the
final convolutional layer of the network. The regression head
consists of fully connected layers that predict the coordinates of the
bounding box. Such simple modifications of networks that were otherwise
designed for classification are always possible.
Practice Excercise
Practical 3:
Open Tutorial 3 of CNN available on the workshop GitHub page.
Alternatively, click
here.
Save a copy of this in your Google Colab.
In this exercise, we build a simple CNN for classification using
MNIST dataset.